Introduction

Alzheimer’s disease (AD) effects approximately 5.8 million people in the the United States ages 65 and older, with early sign and symptoms of the disease centering around forgetfulness, followed by impairment of memory until loss of ability to carry out everyday tasks. There is currently no treatment that cures AD or alters the disease process in the brain. Mayo Clinic.

Recent genetic studies of humans have identified brain specific myeloid cells (microglia) as a potential key cell type controlling an individual’s risk of acquiring Alzheimer’s Disease (AD):

In this report, I will be performing a differential expression (DE) analysis of RNA-seq data that has been derived from both Alzheimer’s patient and control individual Microglia. This data is associated with the publication: Alzheimer’s Patient Microglia Exhibit Enhanced Aging and Unique Transcriptional Activation, of which the graphical abstract is below:

This study examined the expression of twenty-five genes known or proposed to be associated with AD risk or progression and found that most AD risk genes in their purified frozen brain tissue cell types showed preferential expression in microglia compared with other brain cell types. They have also examined whether any of these genes displayed altered expression levels in AD versus control cells, and observed that APOE, ABCA7, GPR141, PTK2B, SPI1, and ZYX appeared upregulated in AD microglia, whereas MEF2C appeared downregulated (un-adjusted p value < 0.05). The indication of upregulation of APOE is specifically interesting, as previous studies have indicated to the potential importance of the gene APOE ε4: the most prevalent yet understudied risk factor for Alzheimer’s disease.

My aim specifically is to recapitulate the author’s results, confirming that the prior mentioned set (along with the total set of 56 identified differentially expressed genes) is indeed upregulated in AD patient microglia in comparison to controls. Throughout this analysis I intend to hold this data to high quality control standards in order to ensure the correctness of the analysis findings. In order for this report to be reproducible and easily understood, I will make use of the industry-standard splice-aware RNA-seq aligner STAR, and differential expression profiling tool DESeq2.

Results

Differential Expression Volcano Plot

Differentially Expressed Gene List

Of the 56 genes identified in my hypothesis, I was only able to confirm 14 gene as differentially expressed in my final results of this report. However, these genes are also a set of the 25 literature-supported genes identified in this study as well, raising my confidence in their true differential nature. A table containing the confirmed genes with their multiple-testing corrected p-value as well as log_2 fold change of expression can be found below. An extended list of genes found to be differentially expressed in this report can be found in the Differential Expression Analysis section of this report.

kable(study_de_genes_df)
padj log2FoldChange
SERPINF1 0.000041 1.7407
CECR2 0.000041 1.6923
TGFBI 0.004718 -1.9188
GLT1D1 0.005319 2.4133
ARSA 0.019995 -1.2752
PTPRG 0.019995 -1.7361
S100A4 0.019995 -3.5340
APOE 0.019995 -1.5115
MOV10L1 0.037131 2.5397
ASTN1 0.041758 1.4381
DPYD 0.042672 -1.8551
STEAP3 0.044855 -1.4818
IGSF10 0.048952 1.5660
PSTPIP1 0.048952 -1.8852
MEIS1 0.059128 1.6870
RFX2 0.059128 -1.5431
TNFRSF21 0.091389 1.9729
SELENBP1 0.091389 1.5837
PLXNC1 0.093845 -1.2515
SMIM3 0.107681 -1.6002
KCNJ5 0.233903 -1.3691
NIN 0.240712 0.6251
TLN2 0.240922 0.7447
ACD 0.240922 -1.0323
LSR 0.272270 -1.0618
VENTX 0.302060 -0.9980
SECTM1 0.351846 -2.1158
RIMS2 0.361383 1.0546
IL15 0.365492 -0.8582
ZNF662 0.379506 0.7173
PTPRZ1 0.382818 1.2478
FOXP1 0.405825 -0.6839
ZBTB8B 0.428732 1.6661
GRIA2 0.440109 1.2523
ZNF532 0.446205 0.6550
ADAM8 0.461316 -1.3041
TSHZ3 0.504754 -1.0101
TTYH3 0.508623 -0.8394
FBRSL1 0.510608 -0.6324
EMP2 0.539124 -0.9036
SMAD7 0.549781 -0.6461
RUNX3 0.553558 -0.7312
ULK3 0.561052 -0.4504
CBX6 0.567129 -0.4916
SLC38A7 0.588293 -0.4095
CLDN15 0.664548 -0.6452
GYPC 0.701038 -0.5979
ADAMTS13 0.735986 -0.7492
GAS2L1 0.827672 -0.4072
ATOH8 0.828134 -0.6446
TM9SF1 0.878700 -0.4394
ZNF703 0.886590 -0.4055
A1BG 0.896340 -0.4483
ZNF696 0.912330 -0.3390
ZNF843 0.982388 -0.0933
CHCHD5 0.992804 -0.0474

Possible Future Experiments and Analyses:

After performing this analysis, I have identified a few areas that could be improved or extended upon.

The study in question offers 4 total bulk-sorted cell types, and in this report I have chosen to only analyze the Myeloid cell type. Thus, in the future it might be interesting to analyze the other cell type data (Astrocytes, Endothelial, Neurons). As well as analyzing more cell types, more background-literature genes of interest could be integrated into this analysis, in order to test for differential expression in this dataset.

As well as possible further analysis with the dataset in hand, there is room for improvement in the quality of collected data as well. Due to the nature of this dataset being generated from frozen tissue, large amounts of long-non-coding RNAs and Mitochondrial RNAs were detected, causing the need for filtering by only protein-coding genes. This is obviously not the ideal setting, so there is certainly room for improvement of analyzing fresher tissue datasets, or more well-preserved datasets. This is a challenge though as large, well annotated, fresh, human cohort data has not been practical to collect in the past.

Methods

Downloads

Data Background

Information the fastq files used in this study can be found here as well as NCBI GEO

Downloading Myeloid Data

For this analysis, I have chosen to proceed with one of the four available cell types, being myeloid / microglia cells. However, the script below can be easily be adapted to download all cell_types with a simple for loop.

alzheimers_dataset="/Genomic_Data_Storage/Alzheimers_Project"
mkdir ${alzheimers_dataset}/accession_tables

wget https://raw.githubusercontent.com/jakesauter/Next_Gen_Sequencing_Project/main/accession_tables/Sra_run_table.txt \
    --output-document="${alzheimers_dataset}/accession_tables/Sra_run_table.txt"

unique_cell_types=$(cat ${alzheimers_dataset}/accession_tables/Sra_run_table.txt | cut -d',' -f9 | sort | uniq)


cell_type='myeloid'
echo "Downloading dat for cell type: ${cell_type}"
output_dir=${alzheimers_dataset}/${cell_type}_fastq_files
acc_nums=$(cat ${alzheimers_dataset}/accession_tables/Sra_run_table.txt | grep $cell_type | cut -d',' -f1)

for acc_num in "${acc_nums[@]}" ; do 
  fasterq-dump --outdir $output_dir $acc_num
  gzip ${output_dir}/${acc_num}.fastq

Note that for microglia data specifically, author’s have included the following table, found in the Experimental model and subject details section of the publication:

Downloading STAR resources

In order to generate a STAR index for later mapping, we will first need to download the human reference genome, as well as the human reference annotation of genes that is compatible with this version of the human reference genome. More background on STAR is provided later in this report.

mkdir human_genome_38
cd human_genome_38
wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/gencode.v36.annotation.gtf.gz
gunzip * 

Now that we have both the reference genome and our features of interest, we can generate our STAR index using the command below, saving our index to the --genomeDir directory of human_genome_38_STAR_index.

STAR --runMode  genomeGenerate \
--runThreadN 16 \
--genomeDir human_genome_38_STAR_index \
--genomeFastaFiles human_genome_38/hg38.fa \
--sjdbGTFfile human_genome_38/gencode.v36.annotation.gtf.gz \
--sjdbOverhang 49 

Data Processing

FastQC and MultiQC

FastQC is an industry standard Next-Generation Sequencing Quality Control tool written and maintained by the Bioinformatics Group at the Babraham Institute.

We can see the goal and mission of FastQC from an excerpt on their website:

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

FastQC in the following analyses will be used for providing a quick, easy to generate overview of FastQ file quality to indicate potential problem areas in the project data, in the form of HTML based permanent reports, containing summary graphs and tables describing different aspects of the input data.

Example quality control passing and quality control failing can be found on FastQC’s website as well.

MultiQC is a useful tool to use in this context as well, as it is a tool that aggregates results from analyses across many samples (processed by FastQC in this case) into a single report. As seen below, the MultiQC command is very simple, and if all FastQC results are located in the same directory, that directory can be provided as input for MultiQC to aggregate all reports in the directory.

Key FastQC and MultiQC Commands

fastqc $fastq_file --noextract \
     --outdir=$fastqc_outdir \
     --threads=14 \
     $fastq_files

multiqc $fastqc_outdir

TrimGalore

Found later in this report, after observing the first application of FastQC to the raw FastQ data files, highly repeated sequences were found. In order to both detect and remove these sequences from the raw input FastQ data, we can use the TrimGalore. Also written an maintained by the Bioinformatics Group at the Babraham Institute.

TrimGalore is a wrapper tool, providing the functionality of both Cutadapt and FastQC to provided adapter trimming and quality control to FastQ files. This report does not make use of the FastQC functionality of TrimGalore in order to separate the QC and raw data files in an easier to understand way, calling FastQC directly with the required arguments.

Key TrimGalore Commands

$trim_galore $fastq_files \
--output_dir="${trim_galore_outdir}" \
--cores 14 --phred33 --gzip 

STAR

With over 16,787 citations on Google Scholar, STAR (Spliced Transcripts Alignment to a Reference) has proven to be a great tool for RNA-seq alignment.

It is best to first assemble a reference built upon your organism - gene model of interest with STAR --runMode genomeGenerate, followed by mapping your RNA-seq reads with STAR --runMode alignReads. During alignment, STAR takes as input single-end or pair-end fastq files, with the output file being in SAM (Spliced Alignment/Mapping) format. The sam file should then usually be converted into BAM (Binary Alignment/Mapping) file, a binary file of sam, for shortening computation time in further analyzing differential expression within RNA-seq data.

Key STAR Commands:

Earlier in this report, we have generated our STAR reference index with the following command:

STAR --runMode  genomeGenerate \
--runThreadN 16 \
--genomeDir human_genome_38_STAR_index \
--genomeFastaFiles human_genome_38/hg38.fa \
--sjdbGTFfile human_genome_38/gencode.v36.annotation.gtf.gz \
--sjdbOverhang 49 

Given the raw or adapter-trimmed FastQ files, as well as the reference index we have generated in the previous command, we can align reads in the following way:

STAR --runMode  alignReads \
  --runThreadN 12 \
  --genomeDir  human_genome_38_STAR_index \
  --readFilesIn  $fastq_file \
  --readFilesCommand  zcat \
  --outSAMtype  BAM  SortedByCoordinate \
  --outFileNamePrefix $prefix  \
  --outSAMattributes All \
  --outReadsUnmapped="Fastx" \
  --outTmpKeep="None"

featureCounts

featureCounts is an “efficient general purpose program for assigning sequence reads to genomic features” and serves as a great software program developed for counting reads to genomic features such as genes, exons, promoters and genomic bins

Key featureCounts Command

featureCounts \
-a $annot_file \
-o $out_file \
-T 14 $bam_files

Data Processing Script

In the script below, once all of the files are trimmed and saved, FastQC (paired with multiqc aggregation) is used to facilitate sequence quality control.

#!/bin/bash 

data_storage_dir="/Genomic_Data_Storage/Alzheimers_Project"
trim_galore="/home/t730/TrimGalore/TrimGalore-0.6.6/trim_galore"

cell_type="myeloid"
fastq_files=$(ls "${data_storage_dir}/${cell_type}_fastq_files"/*.fastq.gz)

trim_galore_outdir="${data_storage_dir}/trimmed_${cell_type}_fastqs"
mkdir $trim_galore_outdir

fastqc_outdir="${data_storage_dir}/trimmed_${cell_type}_fastqc"
mkdir $fastqc_outdir

$trim_galore $fastq_files \
--output_dir="${trim_galore_outdir}" \
--cores 14 --phred33 --gzip 

trimmed_files=$(ls $trim_galore_outdir/*.fq.gz)

fastqc $trimmed_files \
--outdir=$fastqc_outdir \
--threads=14 --noextract

cd $fastqc_outdir
multiqc .

for fastq_file in $trimmed_files; do
  base=$(basename $fastq_file)
  prefix=$(echo $fastq_file | egrep -o "SRR[0-9]+")
  prefix="trimmed_${cell_type}_alignments/${prefix}_trimmed."
  
  STAR --runMode  alignReads \
  --runThreadN 12 \
  --genomeDir  human_genome_38_STAR_index \
  --readFilesIn  $fastq_file \
  --readFilesCommand  zcat \
  --outSAMtype  BAM  SortedByCoordinate \
  --outFileNamePrefix $prefix  \
  --outSAMattributes All \
  --outReadsUnmapped="Fastx" \
  --outTmpKeep="None"
  
  samtools index ${prefix}.Aligned.sortedByCoord.out.bam
done

annot_file=${data_storage_dir}/human_genome_38/gencode.v36.annotation.gtf
out_dir=${data_storage_dir}/featureCounts
alignment_dir=${data_storage_dir}/trimmed_${cell_type}_alignments
out_file=${out_dir}/featCounts_genes.txt
bam_files=$(ls ${alignment_dir}/SRR???????_trimmed.Aligned.sortedByCoord.out.bam)

mkdir $out_dir

featureCounts \
-a $annot_file \
-o $out_file \
-T 14 $bam_files

Trimmed MultiQC HTML Report

Quality Control

FastQC Raw Data QC

Signs of cyclic GC content for some runs are present in the dataset, however fastq tile coordinates are not available in the GEO dataset to allow for selective tile filtering (Github issue showing why SRA may not store this information). Particularly these issues were seen in three of the samples, with the effects occurring with varying degree. These samples were

An example of this failing behavior:

STAR Alignment QC

In order to determine the percentage of reads mapped, we have used the --outReadsUnmapped="Fastx" parameter while running STAR, and now can compare the number of reads in our mapped files, versus the number of reads in our unmapped files. A bash function for performing this operation (calculate_unampped_reads) is applied to all of the BAM files generated in the STAR alignment phase of the report.

#!/bin/bash

data_storage_dir=/Genomic_Data_Storage/Alzheimers_Project

trim_fastq_dir=${data_storage_dir}/trimmed_myeloid_fastqs
trim_align_dir=${data_storage_dir}/trimmed_myeloid_alignments
bam_files=$(ls ${data_storage_dir}/trimmed_myeloid_alignments/*.bam)
parallel_tmp_dir=$(mktemp -d)

calculate_unampped_reads() {
  base=$(basename $1)
  prefix=$(echo $base | egrep -o "SRR[0-9]+")
  
  unammped_reads_file=${trim_align_dir}/${prefix}_trimmed.Unmapped.out.mate1
  total_reads_file=${trim_fastq_dir}/${prefix}_trimmed.fq.gz
  num_lines=$(cat $unammped_reads_file | wc -l)
  num_total_lines=$(zcat $total_reads_file | wc -l)
  perc_unmapped_reads=$(echo -e "(${num_lines}/${num_total_lines})*100" | bc -l | xargs printf %.2f)
  echo -e "Percentage of unmapped reads for ${prefix}: ${perc_unmapped_reads}%" > ${parallel_tmp_dir}/${prefix}.out
}

for bam_file in $bam_files; do
  calculate_unampped_reads $bam_file &
done 

cat $parallel_tmp_dir/*.out

Results of executing this code can be found below.

Percentage of unmapped reads for SRR8440443: 5.90%
Percentage of unmapped reads for SRR8440447: 4.17%
Percentage of unmapped reads for SRR8440448: 4.04%
Percentage of unmapped reads for SRR8440449: 8.38%
Percentage of unmapped reads for SRR8440463: 33.79%
Percentage of unmapped reads for SRR8440477: 5.14%
Percentage of unmapped reads for SRR8440482: 6.54%
Percentage of unmapped reads for SRR8440484: 3.65%
Percentage of unmapped reads for SRR8440486: 8.13%
Percentage of unmapped reads for SRR8440488: 5.19%
Percentage of unmapped reads for SRR8440501: 6.98%
Percentage of unmapped reads for SRR8440511: 5.47%
Percentage of unmapped reads for SRR8440513: 6.79%
Percentage of unmapped reads for SRR8440514: 4.97%
Percentage of unmapped reads for SRR8440517: 5.76%
Percentage of unmapped reads for SRR8440518: 5.02%
Percentage of unmapped reads for SRR8440524: 11.28%
Percentage of unmapped reads for SRR8440525: 9.51%
Percentage of unmapped reads for SRR8440527: 7.05%
Percentage of unmapped reads for SRR8440529: 6.67%
Percentage of unmapped reads for SRR8440531: 10.64%
Percentage of unmapped reads for SRR8440536: 7.61%
Percentage of unmapped reads for SRR8440538: 7.43%
Percentage of unmapped reads for SRR8440539: 12.34%
Percentage of unmapped reads for SRR8440542: 8.16%

From the above output, we see that SRR8440463 as an unusually high amount of unmapped reads compared to all other samples in the study. When cross-referencing our previously generated MultiQC report from earlier, we can see that the GC content of this sample appears to be off compared to all other samples as well, possibly indicating contamination or other library quality issues.

Problematic cyclic GC samples from before: SRR8440524, SRR8440538, and SRR8440539 NOT SRR8440463 that we see with GC-rich contamination above.

FastQC Reports:

We will keep these QC issues in mind moving forward, and remove the samples before analysis in R.

Data Analysis

Below we will use the DESeq2 R package to make a DESeqDataSet from our resultant files from running featureCounts. At this step, I have chosen to remove our 4 samples with QC issues after preliminary analysis, as well as another 4 samples for having an undefined value (NA) for the sex attribute, a possible confounder analyzed in this report.

Required Libraries:

library(tidyr)        # pivot_longer
library(knitr)        # Rmarkdown chunk options and kable
library(dplyr)        # filter, summarise
library(DESeq2)       # EDS and DE
library(ggplot2)      # Plotting
library(stringr)      # String searching and extraction
library(magrittr)     # Piping
library(patchwork)    # Image layout
library(plyranges)    # gtf file manipulation
library(rtracklayer)  # gtf file manipulation

Importing featureCounts Data

read_counts <- read.table(
  '../featCounts/featCounts_genes.txt', 
  header = TRUE, stringsAsFactors = FALSE) 

orig_names <- colnames(read_counts)
colnames(read_counts) <- gsub(".*(SRR[0-9]+).*", "\\1", orig_names)
row.names(read_counts) <- make.names(read_counts$Geneid)
read_counts <- read_counts[,-c(1:6)]
read_counts <- as.matrix(read_counts)

Filtering failed QC Samples

We will now filter the samples previously seen to have low-quality aspects that have the possibility of adding unnecessary noise to our analyses. The samples SRR8440524, SRR8440538, and SRR8440539 will be removed due to cyclic gc-content findings, SRR8440463 will be removed due to GC-rich contamination, and finally SRR8440447 and SRR8440517 will be removed due to unknown sex of the sample (a possible confounder).

keep_samples <- 
  colnames(read_counts) %in% 
  c('SRR8440524', 'SRR8440463',
    'SRR8440538', 'SRR8440539', 
    "SRR8440484", "SRR8440488", 
    "SRR8440447", "SRR8440517"
    ) %>% 
  not()

read_counts <- 
  read_counts[ , keep_samples]

read_counts %>% 
  .[1:4, 1:5] %>% 
  kable()    
SRR8440443 SRR8440448 SRR8440449 SRR8440477 SRR8440482
ENSG00000223972.5 0 0 0 0 0
ENSG00000227232.5 0 1 0 0 0
ENSG00000278267.1 0 0 0 0 0
ENSG00000243485.5 0 0 0 0 0

Converting Gene symbols using GTF

Now that we have the read_counts for all of our samples of interest, we can associate our known metadata with the genes, which can later be incorporated into our DESeq2 object’s rowData. In order to associate these data, the gene information can first be read in the gencode.v36.annotation.gtf.gz file with plyranger::read_gff.

Seen later in this report, I will be filtering for only protein-coding genes, as well as Non-Mitochondrial RNA in order to achieve proper cross-sample normalization results. This result is achieved through filtering the data.frame derived from the GTF gene info file, followed by a left_join on the row names of our gene count information with the associated gene symbol.

genes_df <- 
  read_gff("../gtf/gencode.v36.annotation.gtf.gz", 
           genome_info="hg38") %>% 
  select(gene_id, gene_type, gene_name) %>% 
  as.data.frame()

gene_info_df <-
  data.frame(gencode_id = rownames(read_counts)) %>% 
  left_join(., genes_df, 
            by = c('gencode_id' = 'gene_id')) %>% 
  select(gencode_id, gene_name, gene_type) %>% 
  filter(gene_type == 'protein_coding',
         !str_detect(gene_name, '^MT-')) %>%
  distinct() 

read_counts <- 
  read_counts[gene_info_df$gencode_id, ]

Importing Sample Data

Now that we have our raw read counts and rowData metadata, lets gather information about our samples to supply to the colData of DESeq2 object.

accession_table <-
  read.table('../accession_tables/Sra_run_table.txt', 
             header = TRUE, sep = ',', stringsAsFactors = FALSE) %>% 
  as.data.frame() %>% 
  filter(Cell_type == 'myeloid', 
         !is.na(sex)) %>% 
  select(Run, Diagnosis, sex,
                expired_age, APOE, pmi, 
                Cell_type) %>%
  arrange(Diagnosis, sex)

accession_table %>% 
  kable()
Run Diagnosis sex expired_age APOE pmi Cell_type
SRR8440448 AD female 84 3/4 3.00 myeloid
SRR8440514 AD female 78 3/4 2.16 myeloid
SRR8440538 AD female 84 3/3 1.83 myeloid
SRR8440477 AD male 88 3/4 3.00 myeloid
SRR8440501 AD male 79 3/3 2.00 myeloid
SRR8440486 AD male 75 3/3 3.00 myeloid
SRR8440542 AD male 70 3/4 2.33 myeloid
SRR8440463 AD male 72 4/4 4.16 myeloid
SRR8440511 Control female 59 3/3 3.15 myeloid
SRR8440518 Control female 88 3/3 3.00 myeloid
SRR8440443 Control female >90 3/3 2.50 myeloid
SRR8440529 Control female 83 3/3 3.25 myeloid
SRR8440527 Control male 79 3/3 3.00 myeloid
SRR8440449 Control male 74 2/3 3.25 myeloid
SRR8440482 Control male >90 2/3 1.50 myeloid
SRR8440513 Control male >90 3/3 1.50 myeloid
SRR8440524 Control male 86 3/3 4.75 myeloid
SRR8440525 Control male 86 3/3 3.00 myeloid
SRR8440531 Control male 38 3/3 3.00 myeloid
SRR8440536 Control male 61 3/3 2.33 myeloid
SRR8440539 Control male >90 3/3 3.00 myeloid

From this accession table, we can map our accession ids to any clinical variable we like, such as Diagnosis, which is our main experimental group condition in this analysis. Diagnosis in our case tells us whether or not the diseased patient showed signs of Alzheimer’s disease or not (control).

Creating DESeq2 Object

conditions <-
  colnames(read_counts) %>% 
  lapply(function(run_name) 
      accession_table[accession_table['Run'] == run_name, ]) %>% 
  bind_rows() %>% 
  mutate(sex = factor(sex), 
         Diagnosis = factor(Diagnosis,
                            levels = c('Control', 'AD')))

sample_info <-
  DataFrame(conditions,
            row.names = conditions$Run) 


DESeq.ds <- 
  DESeqDataSetFromMatrix(
    countData = as.matrix(read_counts),
    colData = sample_info,
    rowData = gene_info_df, 
    design = ~ sex + Diagnosis)

Lets inspect our rowData, colData, and counts attributes that we have provided to our new object.

colData

Run Diagnosis sex expired_age APOE pmi Cell_type
SRR8440443 SRR8440443 Control female >90 3/3 2.50 myeloid
SRR8440448 SRR8440448 AD female 84 3/4 3.00 myeloid
SRR8440449 SRR8440449 Control male 74 2/3 3.25 myeloid
SRR8440477 SRR8440477 AD male 88 3/4 3.00 myeloid
SRR8440482 SRR8440482 Control male >90 2/3 1.50 myeloid
SRR8440486 SRR8440486 AD male 75 3/3 3.00 myeloid

rowData

gencode_id gene_name gene_type
ENSG00000186092.6 ENSG00000186092.6 OR4F5 protein_coding
ENSG00000284733.1 ENSG00000284733.1 OR4F29 protein_coding
ENSG00000284662.1 ENSG00000284662.1 OR4F16 protein_coding
ENSG00000187634.12 ENSG00000187634.12 SAMD11 protein_coding
ENSG00000188976.11 ENSG00000188976.11 NOC2L protein_coding
ENSG00000187961.14 ENSG00000187961.14 KLHL17 protein_coding

counts

SRR8440443 SRR8440448 SRR8440449 SRR8440477 SRR8440482
ENSG00000186092.6 0 0 0 0 0
ENSG00000284733.1 0 0 0 0 0
ENSG00000284662.1 0 0 0 0 0
ENSG00000187634.12 0 1 0 1 9
ENSG00000188976.11 123 256 128 134 138
ENSG00000187961.14 65 48 6 29 56

Mapping Quality Assessment

Number of reads sequenced per sample

In order to determine useful characteristics about our sample read mappings to inform proper normalization, we can begin by inspecting how many reads we have acquired per sample.

colSums(counts(DESeq.ds)) %>% 
  data.frame(sample = str_extract(names(.), '[0-9]{3}$'), reads = .) %>% 
  ggplot() + 
  geom_col(aes(x = sample, y = reads)) + 
  theme(axis.text.x = element_text(angle=-35))

Filtering Gene Counts

Moving forward, we will filter our DESeqDataSet as to only include genes that have a read mapped for at least one sample.

n_previous_genes <- nrow(DESeq.ds)
keep_genes <- rowSums(counts(DESeq.ds)) > 0
DESeq.ds <- DESeq.ds[keep_genes, ]

cat(n_previous_genes - nrow(DESeq.ds), 'genes removed due to no non-zero counts found')
981 genes removed due to no non-zero counts found

Read Count Normalization

In order to properly compare our sampled RNA read counts across samples, we must normalize our raw RNA read counts for both the sequencing depth and overall expression of other genes for each sample. This can be accomplished by using DESeq2’s size factor normalization, performed with estimateSizeFactors() and accessed with sizeFactors(). This step is particularly important as it is the first step of the DESeq2 differential expression analysis based on the Negative Binomial (Gamma-Poisson) distribution. More clearly, the full steps of the differential expression analysis will be

  1. Estimation of the size factors with DESeq2::estimateSizeFactors()
  2. Estimation of the dispersion with DESeq2::esimateDispersions()
  3. Fitting the Negative Binomial GLM model, Wald statistics, and p-values with nbinomWaldTest
DESeq.ds <- estimateSizeFactors(DESeq.ds)

plot(sizeFactors(DESeq.ds), colSums(counts(DESeq.ds)), 
     ylab = 'library sizes', xlab = 'size factors')

From the plot above, we can see that our key necessary assumptions that size factors should be around 1 and scale with number of reads per sample is met. However, this was not originally the case before some data filtering. See below the results of size factor normalization performed with no reads filtered, mitochondrial RNA reads filtered, and non-protein coding and mitochondrial RNA read filtered respectively.

Normalizing Read Counts with Size Factors

In the previous step, we have calculated the sizeFactors of our samples, and confirmed that the results match required expectations. Now we can use this result in order to properly normalize our samples. This is accomplished with the DESeq2::counts() accessor, with added parameter normalized = TRUE. Below we compare the filtered but non-normalized RNA read counts, with the filtered and normalized read counts.

p1 <-
  DESeq.ds %>% 
  counts() %>% 
  {. + 1} %>% 
  log2() %>% 
  as.data.frame() %>% 
  pivot_longer(cols = everything()) %>%
  mutate(name = str_extract(name, '[0-9]{3}$')) %>% 
  ggplot() + 
    geom_boxplot(aes(x = name, y = value)) +
    ggtitle("Non-normalized read counts") +
    xlab('Sample') + ylab("log2(read counts + 1)") + 
    theme(axis.text.x = element_text(angle = -30))


p2 <-
  DESeq.ds %>% 
  counts(normalize= TRUE) %>% 
  {. + 1} %>% 
  log2() %>% 
  as.data.frame() %>% 
  pivot_longer(cols = everything()) %>%
  mutate(name = str_extract(name, '[0-9]{3}$')) %>% 
  ggplot() + 
    geom_boxplot(aes(x = name, y = value)) +
    ggtitle("Size-factor-normalized read counts") +
    xlab('Sample') + ylab("log2(read counts + 1)") + 
    theme(axis.text.x = element_text(angle = -30))

p1 + p2

I would again like to show how the result of this normalization procedure has influenced the gene-filtering I have applied in previous step of the report. Below I show the results of the same normalization procedure as above, however with no gene filtering, Mitochondrial RNA filtering, and non-protein coding and Mitochondrial RNA filtering respectively.

No Gene Filtering Applied

Mitchondrial RNA Filtering Applied

Mitchondrial RNA and Non Protein-Coding Filtering Applied

Exploratory Data Analysis

In order to briefly explore possible relationships and patterns in our data, such as confounding variables or strength of separation between Diagnosis, we will apply DESeq2::rlog to the raw count data. The rlog function transforms the count data to the log2 scale, while also minimizing differences between samples for rows with small counts, and even normalizing with respect to library size ontop of that.

Note that in the previous section, we observed the sizeFactors(), resulting from size factor normalization that will be used for the differential expression analysis, however rlog normalized data will be used strictly for exploratory data analysis in this section of the report, before we return to the DESeq2 driven differential expression analysis in the next sections.

Below we apply DESeq2::rlog() to the filtered DESeq2 object that we have created previously, and plot the normalization results for matched gene-expression between two pairs of matched samples of which had an observable difference in raw read count numbers.

DESeq.rlog <- rlog(DESeq.ds, blind = TRUE)

log.norm.counts <-
  DESeq.ds %>% 
  counts(normalized=TRUE) %>% 
  {.+1} %>% 
  log2()


p1 <-
  data.frame('Female_Control_518' = log.norm.counts[,'SRR8440518'], 
           'Female_Control_529' = log.norm.counts[,'SRR8440529']) %>% 
  ggplot() + 
  geom_point(aes(x = Female_Control_518, y = Female_Control_529), 
             alpha = 0.01) + 
  geom_abline(slope = 1, intercept = 0, 
              lty = 2, col = 'red') + 
  ggtitle("size factor and log2-transformed") + 
  xlab('Female_Control_518') + ylab('Female_Control_529')

## the rlog-transformed counts are stored in the accessor "assay"
p2 <-
  data.frame('Female_Control_518' = assay(DESeq.rlog)[,'SRR8440518'], 
           'Female_Control_529' = assay(DESeq.rlog)[,'SRR8440529']) %>% 
  ggplot() + 
  geom_point(aes(x = Female_Control_518, y = Female_Control_529), 
             alpha = 0.01) + 
  geom_abline(slope = 1, intercept = 0, 
              lty = 2, col = 'red') + 
  ggtitle("rlog transformed") + 
  xlab('Female_Control_518') + ylab('Female_Control_529') 

p3 <-
  data.frame('Male_AD_477' = log.norm.counts[,'SRR8440477'], 
           'Male_AD_486' = log.norm.counts[,'SRR8440486']) %>% 
  ggplot() + 
  geom_point(aes(x = Male_AD_477, y = Male_AD_486), 
             alpha = 0.01) + 
  geom_abline(slope = 1, intercept = 0, 
              lty = 2, col = 'red') + 
  ggtitle("size factor and log2-transformed") + 
  xlab('Male_AD_477') + ylab('Male_AD_486')

## the rlog-transformed counts are stored in the accessor "assay"
p4 <-
  data.frame('Male_AD_477' = assay(DESeq.rlog)[,'SRR8440477'], 
           'Male_AD_486' = assay(DESeq.rlog)[,'SRR8440486']) %>% 
  ggplot() + 
  geom_point(aes(x = Male_AD_477, y = Male_AD_486), 
             alpha = 0.01) + 
  geom_abline(slope = 1, intercept = 0, 
              lty = 2, col = 'red') + 
  ggtitle("rlog transformed") + 
  xlab('Male_AD_477') + ylab('Male_AD_486') 



(p1 + p2) / (p3 + p4)

Mean vs Stdev before and after rlog

Another aspect of the normalized data that we can inspect is the relationship between the mean and standard deviation of our read counts. Due to the nature of sampled data, we will see that standard deviation will be higher for lowly expressed genes, and vice versa. This can make it difficult to compare these lowly expressed genes between samples. We see that the resultant rlog normalized data on the right below has a much better overall mean vs sd relationship.

log.norm.counts <- 
  DESeq.ds %>% 
  counts(normalized=TRUE) %>% 
  log2()

par(mfrow=c(1,1))

msd_plot <- 
  vsn::meanSdPlot(log.norm.counts,
                  ranks=FALSE, 
                  plot=FALSE)
p1 <- msd_plot$gg + 
  ggtitle("Sequencing depth normalized log2(read counts)") + 
  ylab("standard deviation")

rlog.norm.counts <- 
  DESeq.rlog %>% 
  assay()

msd_plot <- 
  vsn::meanSdPlot(rlog.norm.counts, 
                  ranks=FALSE, 
                  plot = FALSE)

p2 <- msd_plot$gg + 
  ggtitle("rlog transformation") + 
  coord_cartesian(ylim = c(0,3))

p1 + p2

PMI as a Possible Confounder

We first note that the original publication has already chosen samples in attempts to minimize intra-group variation.

To maximize the likelihood of observing differences between AD and control, we selectedonly AD specimens with high scores for amyloid and tau neuro-pathology in frontal cortex, and we selected only control specimens with negligible amounts of these pathologies in this region (see sample metadata in Data S2). AD and control groups had roughly matching distributions of age, sex, and post-mortem interval (PMI).

Post-Mortem-Interval, being the time since death that the sample was processed was not found in the original publication to be required to be modeled as a covariate between AD and Control samples, however for peace of mind we also see that PMI is fairly evenly distributed between the two groups.

accession_table %>% 
  ggplot() + 
    geom_histogram(aes(x = pmi), 
                   color = 'black', 
                   fill = 'skyblue') + 
  facet_wrap(vars(Diagnosis))

APOE as a Possible Confounder

We note that APOE genotype may play a role in AD, though are not including it in the analysis in order to compare with the reference study (APOE ε4: the most prevalent yet understudied risk factor for Alzheimer’s disease). Note that it was also better to not include APOE as DESeq2 will have a problem processing the variable in the design formula as not every level is represented in every group, observed below.

accession_table %>% 
  dplyr::count(Diagnosis, APOE) %>% 
  ggplot() + 
    geom_col(aes(x = APOE, y = n), 
                   color = 'black', 
                   fill = 'skyblue') + 
  facet_wrap(vars(Diagnosis))

Sex as a Possible Counfounder

First in order to check that there isn’t large differences between our male and female patient cohorts, we will generate a correlation heatmap using Pearson correlation.

The table you will see below will be in the following format

All Genes 500 Most Variable Genes
Heatmap will all genes Heatmap with selected genes
PCA with all genes PCA with selected genes
par(mfrow = c(2, 2))

most_variable_genes <-
  assay(DESeq.rlog) %>%  
  apply(1, var) %>% 
  sort(decreasing = TRUE) %>% 
  names(.) %>% 
  head(500)

rlog_corr_coeff <- 
  DESeq.rlog %>% 
  assay() %>% 
  cor(method = "pearson") 

selected_genes_rlog_corr_coeff <- 
  DESeq.rlog %>% 
  subset(rownames(.) %in% most_variable_genes) %>% 
  assay() %>% 
  cor(method = "pearson") 

new_names <-
  rownames(rlog_corr_coeff) %>% 
  sapply(function(run_name) 
    accession_table[accession_table['Run'] == run_name, 'sex']) %>% 
  {paste0(., '_', str_extract(names(.), '[0-9]{3}$'))}

rlog_corr_coeff <- 
  rlog_corr_coeff %>% 
  set_rownames(new_names) %>% 
  set_colnames(new_names)
  
new_names <-
  rownames(selected_genes_rlog_corr_coeff) %>% 
  sapply(function(run_name) 
    accession_table[accession_table['Run'] == run_name, 'sex']) %>% 
  {paste0(., '_', str_extract(names(.), '[0-9]{3}$'))}

selected_genes_rlog_corr_coeff <- 
  selected_genes_rlog_corr_coeff %>% 
  set_rownames(new_names) %>% 
  set_colnames(new_names)

Now that we have our rlog_corr_coeff and selected_genes_rlog_corr_coeff variables, we can plot our results side-by-side below.

par(mfrow = c(2,2))

rlog_corr_coeff %>% 
  {as.dist(1-., upper = TRUE)} %>%
  as.matrix %>% 
  pheatmap::pheatmap(., main = "Pearson correlation")

selected_genes_rlog_corr_coeff %>% 
  {as.dist(1-., upper = TRUE)} %>%
  as.matrix %>% 
  pheatmap::pheatmap(., main = "Pearson correlation")

DESeq.rlog %>% 
  plotPCA(intgroup = 'sex')

DESeq.rlog %>%
  subset(rownames(.) %in% most_variable_genes) %>% 
  plotPCA(intgroup = 'sex')

par(mfrow = c(1, 2), 
    mar=c(0, 4, 4, 2))


rlog_corr_coeff %>% 
  {as.dist(1-.)} %>% 
  hclust() %>%
  plot(., 
       main = "rlog transformed read counts", 
       xlab = "", ylab = "", sub = "")

selected_genes_rlog_corr_coeff %>% 
  {as.dist(1-.)} %>% 
  hclust() %>%
  plot(., 
       main = "rlog transformed read counts", 
       xlab = "",  ylab = "", sub = "")

Literature Supported Specific Gene-Set

From the results above, it is unclear if our groups separate by sex. In order to further determine if we should observe any differentially expressed genes between the sexes, we will only select gene features that were pointed out as literature supported in the reference study. This list is printed below as well for convenience

literature_supported_genes
 [1] "LOC102724661" "ANKRD26P3"    "MOV10L1"      "HIST2H2BA"    "ZBTB8B"      
 [6] "GLT1D1"       "TNFRSF21"     "CECR2"        "PDD6IPP2"     "PTPRZ1"      
[11] "IGSF10"       "GRIA2"        "MEIS1"        "SELENBP1"     "SERPINF1"    
[16] "RIMS2"        "ASTN1"        "TLN2"         "ZNF532"       "ZNF662"      
[21] "NIN"          "ULK3"         "SLC38A7"      "FOXP1"        "ARSA"        
[26] "CTBP1-AS"     "LOC102725328" "CBX6"         "GYPC"         "FBRSL1"      
[31] "SMAD7"        "PLXNC1"       "CLDN15"       "TSHZ3"        "KCNJ5"       
[36] "DPYD"         "STEAP3"       "RUNX3"        "PTPRG"        "ACD"         
[41] "TTYH3"        "LOC100133445" "LOC102724549" "VENTX"        "TM9SF1"      
[46] "SMIM3"        "ZNF703"       "TGFBI"        "EMP2"         "PSTPIP1"     
[51] "ZNF696"       "RFX2"         "APOE"         "ATOH8"        "ADAM8"       
[56] "GAS2L1"       "CHCHD5"       "ADAMTS13"     "A1BG"         "IL15"        
[61] "SECTM1"       "FAM109A"      "LOC100507639" "ZNF843"       "S100A4"      
[66] "LSR"         

Below we see that given the literature-supported genes, our Male and Female sex groups do seem to be seem to be the main source of separation in our data, though some patterns may exist. Moving forward the difference in sex of our cohort should not ruin our differential expression analysis between Control and AD groups, though it may be bet to model as a confounding factor as I have in this report.

Control vs AD Group Analysis

AD vs Control Correlation Heatmap

We can now generate the same Pearson correlation heatmap from above, though this time separating by our true variable of interest: AD vs Control.

column 1: all genes column 2: selected genes row 1: heatmap row 2: pca

par(mfrow = c(2, 2))

rlog_corr_coeff <- 
  DESeq.rlog %>% 
  assay() %>% 
  cor(method = "pearson") 

selected_genes_rlog_corr_coeff <- 
  DESeq.rlog %>% 
  subset(rownames(.) %in% most_variable_genes) %>% 
  assay() %>% 
  cor(method = "pearson") 

new_names <-
  rownames(rlog_corr_coeff) %>% 
  sapply(function(run_name) 
    accession_table[accession_table['Run'] == run_name, 'Diagnosis']) %>% 
  {paste0(., '_', str_extract(names(.), '[0-9]{3}$'))}

rlog_corr_coeff <- 
  rlog_corr_coeff %>% 
  set_rownames(new_names) %>% 
  set_colnames(new_names)
  
new_names <-
  rownames(selected_genes_rlog_corr_coeff) %>% 
  sapply(function(run_name) 
    accession_table[accession_table['Run'] == run_name, 'Diagnosis']) %>% 
  {paste0(., '_', str_extract(names(.), '[0-9]{3}$'))}

selected_genes_rlog_corr_coeff <- 
  selected_genes_rlog_corr_coeff %>% 
  set_rownames(new_names) %>% 
  set_colnames(new_names)

Now that we have our rlog_corr_coeff and selected_genes_rlog_corr_coeff variables, we can plot our results side-by-side below.

par(mfrow = c(1, 2))

rlog_corr_coeff %>% 
  {as.dist(1-., upper = TRUE)} %>%
  as.matrix %>% 
  pheatmap::pheatmap(., main = "Pearson correlation")

selected_genes_rlog_corr_coeff %>% 
  {as.dist(1-., upper = TRUE)} %>%
  as.matrix %>% 
  pheatmap::pheatmap(., main = "Pearson correlation")

DESeq.rlog %>% 
  plotPCA(intgroup = 'sex')

DESeq.rlog %>%
  subset(rownames(.) %in% most_variable_genes) %>% 
  plotPCA(intgroup = 'sex')

par(mfrow = c(1, 2), 
    mar=c(0, 4, 4, 2))

rlog_corr_coeff %>% 
  {as.dist(1-.)} %>% 
  hclust() %>%
  plot(., 
       main = "rlog transformed read counts", 
       xlab = "", ylab = "", sub = "")

selected_genes_rlog_corr_coeff %>% 
  {as.dist(1-.)} %>% 
  hclust() %>%
  plot(., 
       main = "rlog transformed read counts", 
       xlab = "",  ylab = "", sub = "")

Literature Supported Specific Gene-Set

From the results above, it is unclear if our groups are separable at all. In order to further determine if we should observe any differentially expressed genes, we will only select gene features that were pointed out as literature supported in the reference study. This list is printed below as well for convenience

literature_supported_genes
 [1] "LOC102724661" "ANKRD26P3"    "MOV10L1"      "HIST2H2BA"    "ZBTB8B"      
 [6] "GLT1D1"       "TNFRSF21"     "CECR2"        "PDD6IPP2"     "PTPRZ1"      
[11] "IGSF10"       "GRIA2"        "MEIS1"        "SELENBP1"     "SERPINF1"    
[16] "RIMS2"        "ASTN1"        "TLN2"         "ZNF532"       "ZNF662"      
[21] "NIN"          "ULK3"         "SLC38A7"      "FOXP1"        "ARSA"        
[26] "CTBP1-AS"     "LOC102725328" "CBX6"         "GYPC"         "FBRSL1"      
[31] "SMAD7"        "PLXNC1"       "CLDN15"       "TSHZ3"        "KCNJ5"       
[36] "DPYD"         "STEAP3"       "RUNX3"        "PTPRG"        "ACD"         
[41] "TTYH3"        "LOC100133445" "LOC102724549" "VENTX"        "TM9SF1"      
[46] "SMIM3"        "ZNF703"       "TGFBI"        "EMP2"         "PSTPIP1"     
[51] "ZNF696"       "RFX2"         "APOE"         "ATOH8"        "ADAM8"       
[56] "GAS2L1"       "CHCHD5"       "ADAMTS13"     "A1BG"         "IL15"        
[61] "SECTM1"       "FAM109A"      "LOC100507639" "ZNF843"       "S100A4"      
[66] "LSR"         

Below we see that given the literature-supported genes, our Control and AD groups do seem to be separable, and group well by Diagnosis condition. This indicates that we probably should see some differentially expressed genes in the next section.

Differential Expression Analysis

We will now continue with the differential expression analysis that we have already begun by estimating our size factors. To remind the reader, the steps we must take to successfully complete a DESeq2 differential expression analysis are the following:

  1. Estimation of the size factors with DESeq2::estimateSizeFactors()
  2. Estimation of the dispersion with DESeq2::esimateDispersions()
  3. Fitting the Negative Binomial GLM model, Wald statistics, and p-values with nbinomWaldTest

During this analysis, this tutorial from from Nicolas Terrapon, Denis Puthier and Jacques van Helden, as well as this tutorial on differential expression analysis and this tutorial on the Wald Test from the Harvard Chan Bioinformatics Core have served as an excellent reference material.

Estimating Dispersion

As previously noted, the next step in performing our differential expression analysis on our read count set is to estimate the gene-wise dispersion, which represent the variance in gene expression for a given mean expression value. Note that this means information is shared across genes of similar expression values in this step, as DESeq2 is making the assumption that genes with similar expression levels have similar dispersion. Also note that the process executed below (estimateDispersions()) is an instance of Maximum Likelihood estimation of dispersion to the count data. The result that we see plotted below is a red curve, indicating the expected dispersion value for genes of a given expression strength. Each black dot below is a gene with an associated man expression level and MLE of the dispersion.

DESeq.ds <- estimateDispersions(DESeq.ds)
plotDispEsts(DESeq.ds)

From the plot we see above, our gene-dispersion model seems to be a good fit, and we can continue on with differential expression analysis.

Negative Binomial Wald Test

We are now ready to actually perform the differential expression test. Below we apply the nbinomWaldTest to test the coefficients of a fit Negative Binomial GLM model against a normal distribution centered around 0.

DESeq.ds <- nbinomWaldTest(DESeq.ds)

resultDESeq2 <-
  results(DESeq.ds,
          independentFiltering = TRUE, 
          alpha = 0.05)

Distribution of Resultant P-values

Genes Found to be Differentially Expressed

padj log2FoldChange
SERPINF1 0.000041 -1.74
CECR2 0.000041 -1.69
TGFBI 0.004718 1.92
GLT1D1 0.005319 -2.41
MYOM3 0.011052 -4.16
FGL1 0.011338 -2.21
S100A4 0.019996 3.53
FMO1 0.019996 -4.98
PTPRG 0.019996 1.74
PGC 0.019996 -4.22
EVPL 0.019996 -4.52
APOE 0.019996 1.51
ARSA 0.019996 1.28
KDR 0.035990 -3.28
KLF18 0.037131 -5.31
ESRRG 0.037131 -2.05
TGM4 0.037131 -2.85
TECTB 0.037131 -5.73
KRT77 0.037131 -4.85
EPPIN 0.037131 -3.36
MOV10L1 0.037131 -2.54
NEXMIF 0.037131 -1.36
ASTN1 0.041759 -1.44
STRIP2 0.041759 -1.77
C2orf16 0.042241 -2.26
DPYD 0.042671 1.86
EDAR 0.042972 -2.90
DGKI 0.044330 -1.86
STEAP3 0.044854 1.48
PLEKHG1 0.044854 -1.29
USH2A 0.048952 -2.56
CCDC121 0.048952 -1.90
IGSF10 0.048952 -1.57
MCF2L2 0.048952 -1.14
OPRK1 0.048952 -2.66
RGS22 0.048952 -2.79
SORCS1 0.048952 -1.50
PSTPIP1 0.048952 1.89

Where do the study-shown differentially expressed genes fall in our differential expression table?

diff_exp_df %>%
  filter(gene_name %in% study_de_genes) %>% 
  # picking top-pvalue ensmbl-gene for each
  # hgnc gene-symbol identified gene
  group_by(gene_name) %>% 
  arrange(pvalue) %>% 
  filter(row_number() == 1) %>% 
  ungroup() %>%
  as.data.frame() %>% 
  set_rownames(.$gene_name) %>% 
  arrange(padj) %>% 
  select(padj, log2FoldChange) %>% 
  mutate(padj = format(padj, digits = 2, scientific = FALSE), 
         log2FoldChange = format(log2FoldChange, digits = 3)) %>%
  kable()
padj log2FoldChange
SERPINF1 0.000041 -1.7407
CECR2 0.000041 -1.6923
TGFBI 0.004718 1.9188
GLT1D1 0.005319 -2.4133
ARSA 0.019996 1.2752
PTPRG 0.019996 1.7361
S100A4 0.019996 3.5340
APOE 0.019996 1.5115
MOV10L1 0.037131 -2.5397
ASTN1 0.041759 -1.4381
DPYD 0.042671 1.8551
STEAP3 0.044854 1.4818
IGSF10 0.048952 -1.5660
PSTPIP1 0.048952 1.8852
MEIS1 0.059129 -1.6870
RFX2 0.059129 1.5431
TNFRSF21 0.091390 -1.9729
SELENBP1 0.091390 -1.5837
PLXNC1 0.093844 1.2515
SMIM3 0.107682 1.6002
KCNJ5 0.233905 1.3691
NIN 0.240714 -0.6251
TLN2 0.240923 -0.7447
ACD 0.240923 1.0323
LSR 0.272268 1.0618
VENTX 0.302894 0.9980
SECTM1 0.352615 2.1158
RIMS2 0.362129 -1.0546
IL15 0.366225 0.8582
ZNF662 0.380228 -0.7173
PTPRZ1 0.383543 -1.2478
FOXP1 0.406498 0.6839
ZBTB8B 0.429391 -1.6661
GRIA2 0.440735 -1.2523
ZNF532 0.446824 -0.6550
ADAM8 0.461923 1.3041
TSHZ3 0.505242 1.0101
TTYH3 0.509105 0.8394
FBRSL1 0.511080 0.6324
EMP2 0.539551 0.9036
SMAD7 0.550188 0.6461
RUNX3 0.553954 0.7313
ULK3 0.561424 0.4504
CBX6 0.567492 0.4916
SLC38A7 0.588607 0.4095
CLDN15 0.664546 0.6452
GYPC 0.701034 0.5979
ADAMTS13 0.735989 0.7492
GAS2L1 0.827675 0.4072
ATOH8 0.828128 0.6446
TM9SF1 0.878704 0.4394
ZNF703 0.886593 0.4055
A1BG 0.896338 0.4483
ZNF696 0.912327 0.3390
ZNF843 0.982390 0.0933
CHCHD5 0.992806 0.0474

Which genes have we confirmed from the study-selected genes to be differentially expressed?

 [1] "DPYD"     "S100A4"   "ASTN1"    "STEAP3"   "PTPRG"    "IGSF10"  
 [7] "TGFBI"    "GLT1D1"   "PSTPIP1"  "SERPINF1" "APOE"     "CECR2"   
[13] "MOV10L1"  "ARSA"    

Where do the literature-supported genes fall in our differential expression table?

padj log2FoldChange
SERPINF1 0.000041 -1.7407
CECR2 0.000041 -1.6923
TGFBI 0.004718 1.9188
GLT1D1 0.005319 -2.4133
ARSA 0.019996 1.2752
PTPRG 0.019996 1.7361
S100A4 0.019996 3.5340
APOE 0.019996 1.5115
MOV10L1 0.037131 -2.5397
ASTN1 0.041759 -1.4381
DPYD 0.042671 1.8551
STEAP3 0.044854 1.4818
IGSF10 0.048952 -1.5660
PSTPIP1 0.048952 1.8852
MEIS1 0.059129 -1.6870
RFX2 0.059129 1.5431
TNFRSF21 0.091390 -1.9729
SELENBP1 0.091390 -1.5837
PLXNC1 0.093844 1.2515
SMIM3 0.107682 1.6002
KCNJ5 0.233905 1.3691
NIN 0.240714 -0.6251
TLN2 0.240923 -0.7447
ACD 0.240923 1.0323
LSR 0.272268 1.0618
VENTX 0.302894 0.9980
SECTM1 0.352615 2.1158
RIMS2 0.362129 -1.0546
IL15 0.366225 0.8582
ZNF662 0.380228 -0.7173
PTPRZ1 0.383543 -1.2478
FOXP1 0.406498 0.6839
ZBTB8B 0.429391 -1.6661
GRIA2 0.440735 -1.2523
ZNF532 0.446824 -0.6550
ADAM8 0.461923 1.3041
TSHZ3 0.505242 1.0101
TTYH3 0.509105 0.8394
FBRSL1 0.511080 0.6324
EMP2 0.539551 0.9036
SMAD7 0.550188 0.6461
RUNX3 0.553954 0.7313
ULK3 0.561424 0.4504
CBX6 0.567492 0.4916
SLC38A7 0.588607 0.4095
CLDN15 0.664546 0.6452
GYPC 0.701034 0.5979
ADAMTS13 0.735989 0.7492
GAS2L1 0.827675 0.4072
ATOH8 0.828128 0.6446
TM9SF1 0.878704 0.4394
ZNF703 0.886593 0.4055
A1BG 0.896338 0.4483
ZNF696 0.912327 0.3390
ZNF843 0.982390 0.0933
CHCHD5 0.992806 0.0474

Volcano Plot Summarizing Results

alpha <- 0.05

cols <- densCols(resultDESeq2$log2FoldChange, 
                 -log10(resultDESeq2$padj))

plot(resultDESeq2$log2FoldChange, 
     -log10(resultDESeq2$padj), 
     col=cols, panel.first=grid(),
     main="Differential Expression Volcano plot", 
     xlab="Effect size: log2(fold-change)",
     ylab="-log10(adjusted p-value)",
     pch=resultDESeq2$pch, cex=0.4, 
     xlim = c(-6,6), 
     ylim = c(0, 1.05 * 
                max(-log10(resultDESeq2$padj), 
                  na.rm = TRUE))
     )

abline(v=0)
abline(v=c(-2,2), lty = 2, col = 'red')
abline(h=-log10(alpha), lty = 2, col = 'red')

# Plot the names of a reasonable number of genes, 
# by selecting those begin not only significant but 
# also having a strong effect size
genes_to_plot <- 
  sapply(abs(resultDESeq2$log2FoldChange) > 2 & 
          resultDESeq2$padj < alpha, isTRUE)

labels <-
  genes_to_plot %>% 
  {rownames(resultDESeq2)[.]} %>% 
  data.frame(query_gencode_id = .) %>% 
  left_join(gene_info_df, by = c('query_gencode_id' = 'gencode_id')) %>% 
  .$gene_name
  

if (length(labels) > 0) {
  text(resultDESeq2$log2FoldChange[genes_to_plot] + 
                  runif(length(which(genes_to_plot)), 0, 0.3),
       -log10(resultDESeq2$padj[genes_to_plot]) +  
                runif(length(which(genes_to_plot)), 0.01, 1.3),
       lab=labels, cex = 0.6)
}

Gene Ontology of Differentiallty Expressed Genes

# BiocManger::install('goseq')
library(goseq)

gene.vector <- 
  (row.names(resultDESeq2) %in% DGEgenes) %>%
  as.integer()

names(gene.vector) <- row.names(resultDESeq2)

names(gene.vector) <-
      gsub('\\.[0-9]+$', '', names(gene.vector))

pwf <- nullp(gene.vector,'hg19','ensGene', 
             plot.fit = FALSE)

# BiocManager::install('org.Hs.eg.db')
GO.wall <- goseq(pwf,'hg19','ensGene')

sig_GOs <- 
  subset(GO.wall, 
         over_represented_pvalue < 0.01) 

write.table(
  sig_GOs[,c("category","over_represented_pvalue")],
  file = "Enriched_GOterms_goseq.txt",
  quote = FALSE, row.names = FALSE, col.names = FALSE
)

Discussion

In this report, I have confirmed differential expression of 14 of 25 literature supported Alzheimer’s Disease associated genes. Below I provide a brief description of each gene based on NCBI gene information and the GeneCards Human Gene Database

Interesting Identified Genes

SERPINF1 (serpin family F member 1) – The encoded protein is secreted and strongly inhibits angiogenesis. In addition, this protein is a neurotrophic factor involved in neuronal differentiation in retinoblastoma cells. Mutations in this gene were found in individuals with osteogenesis imperfecta.

CECR2 (CECR2 histone acetyl-lysine reader) – Involved in chromatin remodeling, and may additionally play a role in DNA damage response. The encoded protein functions as part of an ATP-dependent complex that is involved in neurulation.

TGFBI (transforming growth factor beta induced ) – The protein is induced by transforming growth factor-beta and acts to inhibit cell adhesion. Mutations in this gene are associated with multiple types of corneal dystrophy.

GLT1D1 (glycosyltransferase 1 domain containing 1) – Protein Coding gene associated with Hepatocellular Carcinoma. Gene Ontology (GO) annotations related to this gene include transferase activity, transferring glycosyl groups. An important paralog of this gene is ALG2.

ARSA (arylsulfatase A) – Defects in this gene lead to metachromatic leucodystrophy (MLD), a progressive demyelination disease which results in a variety of neurological symptoms and ultimately death.

PTPRG (protein tyrosine phosphatase receptor type G) – Known to be signaling molecules that regulate a variety of cellular processes including cell growth, differentiation, mitotic cycle, and oncogenic transformation.

S100A4 ( S100 calcium binding protein A4) – S100 proteins are localized in the cytoplasm and/or nucleus of a wide range of cells, and involved in the regulation of a number of cellular processes such as cell cycle progression and differentiation.

APOE (apolipoprotein E) – Involved in the metabolism of fats in the body of mammals. A subtype is implicated in Alzheimer’s disease and cardiovascular disease.

MOV10L1 (Mov10 like RISC complex RNA helicase 1) – This gene is similar to a mouse gene that encodes a putative RNA helicase and shows testis-specific expression. Multiple transcript variants encoding different isoforms have been found for this gene.

ASTN1 (astrotactin 1) – Neuronal adhesion molecule required for glial-guided migration of young postmitotic neuroblasts in cortical regions of developing brain, including cerebrum, hippocampus, cerebellum, and olfactory bulb.

DPYD (dihydropyrimidine dehydrogenase) – Initial and rate-limiting factor in the pathway of uracil and thymidine catabolism.

STEAP3 (STEAP3 metalloreductase) – This protein may mediate downstream responses to p53, including promoting apoptosis. Deficiency in this gene can cause anemia.

IGSF10 (Immunoglobulin Superfamily Member 10) – Involved in the control of early migration of neurons expressing gonadotropin-releasing hormone (GNRH neurons) (By similarity). May be involved in the maintenance of osteochondroprogenitor cells pool (By similarity).

PSTPIP1 (proline-serine-threonine phosphatase interacting protein 1) – Found in association with the cytoskeleton in myeloid/monocytic cells and modulates immunoregulatory functions

Issues, Problems and Limitations

While analyzing this dataset, a key issue I have found is that it was sourced from frozen-tissue, causing degradation of some cell-resident RNA, and an overall enrichment in the samples for harder-to-degrade RNA like long-non-coding RNAs (lncRNAs) and Mitchondrial RNAs. In order to still provide confident results in the genes I have identified as differentially expressed, I have filtered for only non-Mitochondrial protein-coding genes. This solved many of the read count normalization hurdles, however limited the scope of my analysis and findings as well.

Key Datasets Generated

While performing this analysis I have generated a few key datasets resulting from various processes / commands. A table containing information on the type of datasets as well as how they were generated and can be found below.

Generated.By Dataset.Type Section Generated
TrimGalore Trimmed Reads Trimmed Reads
STAR Aligned Samples Aligned Samples
featureCounts featureCounts featureCounts
DESeq2 Differentially Expressed Genes DE Genes